perf(aiur): Canonical levels, memoized dispatch, flat query record#441
Open
samuelburnham wants to merge 8 commits into
Open
perf(aiur): Canonical levels, memoized dispatch, flat query record#441samuelburnham wants to merge 8 commits into
samuelburnham wants to merge 8 commits into
Conversation
Replace the per-circuit FxIndexMap<Vec<G>, QueryResult> with flat arenas: keys/outputs/multiplicities in contiguous Vec<G>s (fixed stride per circuit) indexed by a hashbrown HashTable<u32>. The record IS the proof witness so entries cannot be dropped — only stored compactly: per-entry RAM drops from ~133 B (two heap Vecs + IndexMap bucket + allocator metadata) to ~54 B, and execution runs ~3x faster wall-clock (no per-call heap allocation, better locality). Entry index == insertion order, preserving the memory-circuit pointer semantics. Circuit stats are bit-identical to the map-based form. Measured on ByteArray.utf8DecodeChar?_utf8EncodeChar_append: at the op-count where the old representation OOM-killed a 249 GB box (12.9B ops, 1.64B entries), the flat form sits at 104 GB vs 246 GB, with bit-identical memory-circuit entry counts. Also adds IX_AIUR_QUERY_STATS=1: periodic (every 2^31 ops) + final dumps of per-function entry counts and retained G-elems, the RAM-attribution diagnostic used to find this and the kernel-side pathologies.
Routing App children through the expr_inst1 dispatch materialized a (child, arg, depth) record entry per distinct arg/depth pair just to return a no-op-substitution child unchanged — 35% of all expr_inst1 entries on instantiation-heavy checks — plus a dispatch row for every real-work recursive call. Guard the App-arm children inline (lbr check, then straight into the walk): no-op subtrees produce zero rows, real work one row instead of two. expr_lbr is keyed on the child alone, so the guard memoizes across arg/depth. App-arm only: guarding the binder arms too blew the walk width 53 -> 141 (nested match-with-call) and regressed FFT +4%; App-only keeps width 94. Measured: foldAdd_2000 FFT 248.6M -> 244.9M; UTF8 record at equal ops (2^31): 265M -> 239.5M entries (-9.7%), expr_inst1 39.1M -> 12.4M (-68%) with bit-identical walk-entry counts.
From ap/kernel 828fb85 (Arthur Paulino): the NbE value domain is defined but referenced nowhere — the live kernel runs on de-Bruijn KExpr; vestigial from an abandoned NbE direction. (The closed-term context normalization from that commit was measured separately and not taken: the per-call expr_lbr probe cost +2.9% FFT on recursor loops for a 0.5% record reduction.)
Three coordinated changes to Const-head whnf dispatch: 1. prim_family(addr) classifies a head address into the one reducer family that could fire on it (nat/str/bitvec/native/decidable; the sets are disjoint). Keyed on the ADDRESS ALONE it memoizes to one row per distinct constant address per run, and whnf_const_head calls at most one family reducer. The previous gauntlet ran every reducer in sequence — 4-6 rows at width 85-92 per Const-head whnf, almost all guaranteed misses. Marginal FFT on the bench suite: -1.3%..-28.3% (Vector.extract_append._proof_2: -28% FFT / -25% wall). 2. The symbolic-Nat offset-stuck check (5dcab7f) moves from a delta-arm probe into try_nat_dispatch's miss path as a cold function (try_nat_offset_dispatch, verdict 2 = "already stuck, do not re-whnf"), with the offset construction shared via mk_nat_offset_stuck (also used by the dbc4177 linear-rec collapse). The probe/inline forms cost +11/+34 circuit WIDTH on hot functions (+0.7% FFT on recursor loops, width charged on every row); the dispatch already has the whnf'd args, so the check is free there. Same semantics, residual +0.057% on one synthetic recursor benchmark. 3. nat_lit_to_ctor_or_self exposes ONE constructor layer (n -> succ(Lit(n-1)), mirroring lean4 inductive.h:91-93) instead of materializing the full succ chain: -2/-5% FFT on Nat.rec-over-literal workloads (Aiur's content-memoization had already collapsed the chain's quadratic re-walk, so the win is the linear constant).
Port the Rust kernel's canonical-form level machinery (level.rs normalize_level / norm_level_eq / norm_level_le, itself a line-by-line port of Lean4Lean's Level.Normalize with the covers-split soundness fix): normalize_aux with imax-path conditioning, phase-2 subsumption, and structural/dominance comparison on canonical forms. level_equal and level_leq now normalize-and-compare; the previous recursive Level.leq mirror with its two-way param-substitution split per Max/IMax — and its helpers level_subst_reduce / level_has_param / level_any_param — is deleted. level_normalize is keyed on the level alone, so each distinct level normalizes once per run. The split was exponential in the number of params and every branch materialized freshly substituted levels. On _private.…SInt.0.Int16.instRxcHasSize_eq the level family was 93% of the record at 2^31 ops (level_subst_reduce 60.8M + level_leq 53.2M + level_max 30.4M entries…), entered through Inductive's ctor-field universe constraint (level_leq), not def-eq's level_equal. Measured: Int16.instRxcHasSize_eq goes from NON-COMPLETING (killed at 178 GB RSS after 21 min, growth not converging) to 2m23s / 17.9 GB / 295.7B FFT. Int8 completes in 21s. The bench suite is unaffected (its level comparisons hit the structural fast path). 297 ixvm tests pass.
Print a constant's direct constant references (literal blobs filtered:
content-pinned, not well-typedness obligations) as the comma-separated
hex list `ix tree canonical` expects, so a single-constant frontier
assumption check is:
ix tree canonical $(ix refs-of <name> --ixe env.ixe)
ix claim check <addr> --asm <root>
ix check --ixe env.ixe --claim <hash>
try_reduce_projection_definition ran per Const-head delta candidate with the SPINE in its memo key, but whether a Defn is a projection wrapper depends only on the constant: 12.7M record entries on the UTF-8 codec check, scaling 39x per SInt width doubling (structure- bundle instances force projections constantly). Split the decision into proj_def_info_of, keyed on the index alone (one row per distinct constant per run), gate the caller on the scalar is_proj_def (let-bound matches only take scalar scrutinees — a tuple-pattern gate compiles but fails toplevel check with "Non-tail match in arbitrary position"), and reach the spine-keyed application only for actual projection-definitions. Measured (297 ixvm tests pass): Int16.instRxcHasSize_eq 189.3B -> 186.8B FFT (-1.3%, wall 1m42s -> 1m39s); mergeSort/Vector/ length_append each improve slightly. The UTF-8-class payoff is larger (12.7M entries -> ~hundreds on that profile).
Two quadratic/constant-factor fixes to check_all_skipping, ported from jcb/fixes (3763356, John C. Burnham): - The assumption-leaf skip set keys on ptr_val instead of the first 4 address bytes: one tree lookup per constant, no per-lookup address load, no confirming address_eq. Sound by the build_addr_pos_map interning invariant (one pointer, one content): a ptr hit implies the address IS a leaf; a de-interned pointer reads as absent and the constant just gets checked — fail-closed. - The iterator walks the addrs list in LOCKSTEP with consts (cur_addrs suffix) instead of list_lookup(addrs, pos) per constant, which re-walked the prefix every iteration — a standalone O(closure^2). Affects sharded checks and assumption-gated claims. 297 ixvm tests; frontier-assumption claim smoke passes. Assessment of the rest of jcb/fixes for Aiur: H-13 is compile-pipeline wall-clock (not Aiur), H-15 is the Rust convenience kernel only. The soundness items (C-1, H-2, H-3, H-12, eta/proj guards) touch Whnf/DefEq/Primitive files this branch has restructured — flagged for merge planning, not ported here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Kernel and record optimizations. Measured against current
main(same machine,lake exe ix check --ixeover fullInit):Int16.instRxcHasSize_eq: does not complete on main (killed at 120 GB RSS,still growing) → 2m00s / 16.8 GB / 202.6 B FFT cost.
Initconstants,−5.5% on a literal-fold synthetic.
Units: M/B = millions/billions of FFT cost units; GB = memory.
Commits, oldest first:
Aiur: flat append-only QueryMap + env-gated record-size stats
Replaces the per-circuit
IndexMap<Vec<G>, QueryResult>record with flat per-circuitarenas (contiguous key/output/multiplicity vectors + a compact hash index). Same
entries in ~54 bytes instead of ~133; circuit stats bit-identical. ~2.5× less
execution RAM, ~3× faster execution. Also adds the opt-in
IX_AIUR_QUERY_STATS=1record dumps used to find the kernel pathologies below.
IxVM: inline lbr guards on expr_inst1_walk App children
Substitution recorded a memo row even for subtrees with no bound variable. The walk
now skips no-op subtrees inline (application arms only — guarding every arm widens
the hot circuit past the savings). −68% rows in the substitution dispatch circuit on
instantiation-heavy checks.
IxVM: drop dead KValNode/KVal/KValEnv
Removes an unreferenced normalization-by-evaluation value domain. Cleanup only.
IxVM: memoized prim_family dispatch + width-safe offset-stuck placement
Const-head whnf ran five primitive reducers in sequence on every head. A classifier
keyed on the constant's address (memoized once per constant per run) names the one
family that could fire; dispatch calls at most that reducer. Also moves the
symbolic-Nat offset-stuck check off the hot path into the Nat reducer's miss path as
a cold function. Largest single dispatch win on the suite; supersedes #438's boolean
gauntlet gate (audited: identical effective head coverage, and primitive heads skip
the gauntlet walk too).
IxVM: canonical level normalization for level_equal / level_leq
Replaces the recursive universe-level comparison — exponential in the number of level
params from its two-way Max/IMax splits — with normalize-then-compare canonical forms
(as in the Rust kernel and Lean4Lean). Each distinct level normalizes once per run.
This is the completability fix: Int16 went from non-completing (level circuits were
93% of its record) to ~2 minutes. The rest of the suite is unaffected.
Ix CLI: refs-of subcommand
Prints a constant's direct references in the format
ix tree canonicalexpects, so asingle-constant assumption-gated check is a three-command pipeline. Tooling only.
IxVM: idx-keyed projection-definition classification
"Is this definition a projection wrapper" was decided per call site with the argument
spine in the memo key; the answer depends only on the constant. Now keyed on the
constant index (one row per constant per run). Drops that circuit from 12.7 M record
entries to a few hundred on UTF-8-class checks.
IxVM: port jcb/fixes H-14 — ptr_val skip map + lockstep addr cursor
Two sharded-check scaling fixes (from jcb/fixes, John C. Burnham): the assumption-leaf
skip set keys on interned pointers instead of address prefixes, and the constant
iterator walks the address list in lockstep instead of re-indexing per constant —
removing a standalone O(closure²). FFT-neutral on single-constant checks.
Benchmarks
_private…SInt.0.Int16.instRxcHasSize_eq_private…Vector.extract_append._proof_1_private…Vector.extract_append._proof_2Nat.fold_congrInt.emod_emod_of_dvdNat.gcd_commNat.shiftRight_eq_div_powmergeSortBenchString.appendfoldAdd_2000List.length_appendNat.add_commByteArray.utf8DecodeChar?_utf8EncodeChar_append